128 research outputs found
Classification with unknown class-conditional label noise on non-compact feature spaces
We investigate the problem of classification in the presence of unknown
class-conditional label noise in which the labels observed by the learner have
been corrupted with some unknown class dependent probability. In order to
obtain finite sample rates, previous approaches to classification with unknown
class-conditional label noise have required that the regression function is
close to its extrema on sets of large measure. We shall consider this problem
in the setting of non-compact metric spaces, where the regression function need
not attain its extrema.
In this setting we determine the minimax optimal learning rates (up to
logarithmic factors). The rate displays interesting threshold behaviour: When
the regression function approaches its extrema at a sufficient rate, the
optimal learning rates are of the same order as those obtained in the
label-noise free setting. If the regression function approaches its extrema
more gradually then classification performance necessarily degrades. In
addition, we present an adaptive algorithm which attains these rates without
prior knowledge of either the distributional parameters or the local density.
This identifies for the first time a scenario in which finite sample rates are
achievable in the label noise setting, but they differ from the optimal rates
without label noise
Robust mixtures in the presence of measurement errors
We develop a mixture-based approach to robust density modeling and outlier
detection for experimental multivariate data that includes measurement error
information. Our model is designed to infer atypical measurements that are not
due to errors, aiming to retrieve potentially interesting peculiar objects.
Since exact inference is not possible in this model, we develop a
tree-structured variational EM solution. This compares favorably against a
fully factorial approximation scheme, approaching the accuracy of a
Markov-Chain-EM, while maintaining computational simplicity. We demonstrate the
benefits of including measurement errors in the model, in terms of improved
outlier detection rates in varying measurement uncertainty conditions. We then
use this approach in detecting peculiar quasars from an astrophysical survey,
given photometric measurements with errors.Comment: (Refereed) Proceedings of the 24-th Annual International Conference
on Machine Learning 2007 (ICML07), (Ed.) Z. Ghahramani. June 20-24, 2007,
Oregon State University, Corvallis, OR, USA, pp. 847-854; Omnipress. ISBN
978-1-59593-793-3; 8 pages, 6 figure
Finding Young Stellar Populations in Elliptical Galaxies from Independent Components of Optical Spectra
Elliptical galaxies are believed to consist of a single population of old
stars formed together at an early epoch in the Universe, yet recent analyses of
galaxy spectra seem to indicate the presence of significant younger populations
of stars in them. The detailed physical modelling of such populations is
computationally expensive, inhibiting the detailed analysis of the several
million galaxy spectra becoming available over the next few years. Here we
present a data mining application aimed at decomposing the spectra of
elliptical galaxies into several coeval stellar populations, without the use of
detailed physical models. This is achieved by performing a linear independent
basis transformation that essentially decouples the initial problem of joint
processing of a set of correlated spectral measurements into that of the
independent processing of a small set of prototypical spectra. Two methods are
investigated: (1) A fast projection approach is derived by exploiting the
correlation structure of neighboring wavelength bins within the spectral data.
(2) A factorisation method that takes advantage of the positivity of the
spectra is also investigated. The preliminary results show that typical
features observed in stellar population spectra of different evolutionary
histories can be convincingly disentangled by these methods, despite the
absence of input physics. The success of this basis transformation analysis in
recovering physically interpretable representations indicates that this
technique is a potentially powerful tool for astronomical data mining.Comment: 12 Pages, 7 figures; accepted in SIAM 2005 International Conference
on Data Mining, Newport Beach, CA, April 200
Dimension-free error bounds from random projections
Learning from high dimensional data is challenging in general β however, often the data is not truly high dimensional in the sense that it may have some hidden low complexity geometry. We give new, user-friendly PAC-bounds that are able to take advantage of such benign geometry to reduce dimensional-dependence of error-guarantees in settings where such dependence is known to be essential in general. This is achieved by employing random projection as an analytic tool, and exploiting its structure-preserving compression ability. We introduce an auxiliary function class that operates on reduced dimensional inputs, and a new complexity term, as the distortion of the loss under random projections. The latter is a hypothesis-dependent data-complexity, whose analytic estimates turn out to recover various regularisation schemes in parametric models, and a notion of intrinsic dimension, as quantified by the Gaussian width of the input support in the case of the nearest neighbour rule. If there is benign geometry present, then the bounds become tighter, otherwise they recover the original dimension-dependent bounds
- β¦